36 research outputs found

    Massively Parallel Single-Source SimRanks in o(logn)o(\log n) Rounds

    Full text link
    SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data management tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node ss and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems such as MapReduce, Hadoop, or Spark. Existing distributed SimRank algorithms enforce either Ω(logn)\Omega(\log n) communication round complexity or Ω(n)\Omega(n) machine space for a graph of nn nodes. We overcome this barrier. Particularly, given a graph of nn nodes, for any query node vv and constant error ϵ>3n\epsilon>\frac{3}{n}, we show that using O(log2logn)O(\log^2 \log n) rounds of communication among machines is almost enough to compute single-source SimRank values with at most ϵ\epsilon absolute errors, while each machine only needs a space sub-linear to nn. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the Θ(logn)\Theta(\log n) round complexity barrier with provable result accuracy

    Learning-Based Approaches for Graph Problems: A Survey

    Full text link
    Over the years, many graph problems specifically those in NP-complete are studied by a wide range of researchers. Some famous examples include graph colouring, travelling salesman problem and subgraph isomorphism. Most of these problems are typically addressed by exact algorithms, approximate algorithms and heuristics. There are however some drawback for each of these methods. Recent studies have employed learning-based frameworks such as machine learning techniques in solving these problems, given that they are useful in discovering new patterns in structured data that can be represented using graphs. This research direction has successfully attracted a considerable amount of attention. In this survey, we provide a systematic review mainly on classic graph problems in which learning-based approaches have been proposed in addressing the problems. We discuss the overview of each framework, and provide analyses based on the design and performance of the framework. Some potential research questions are also suggested. Ultimately, this survey gives a clearer insight and can be used as a stepping stone to the research community in studying problems in this field.Comment: v1: 41 pages; v2: 40 page

    Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

    Full text link
    LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.Comment: 25 pages, 13 figure

    DMCS : Density Modularity based Community Search

    Full text link
    Community Search, or finding a connected subgraph (known as a community) containing the given query nodes in a social network, is a fundamental problem. Most of the existing community search models only focus on the internal cohesiveness of a community. However, a high-quality community often has high modularity, which means dense connections inside communities and sparse connections to the nodes outside the community. In this paper, we conduct a pioneer study on searching a community with high modularity. We point out that while modularity has been popularly used in community detection (without query nodes), it has not been adopted for community search, surprisingly, and its application in community search (related to query nodes) brings in new challenges. We address these challenges by designing a new graph modularity function named Density Modularity. To the best of our knowledge, this is the first work on the community search problem using graph modularity. The community search based on the density modularity, termed as DMCS, is to find a community in a social network that contains all the query nodes and has high density-modularity. We prove that the DMCS problem is NP-hard. To efficiently address DMCS, we present new algorithms that run in log-linear time to the graph size. We conduct extensive experimental studies in real-world and synthetic networks, which offer insights into the efficiency and effectiveness of our algorithms. In particular, our algorithm achieves up to 8.5 times higher accuracy in terms of NMI than baseline algorithms

    SCARA: Scalable Graph Neural Networks with Feature-Oriented Optimization

    Full text link
    Recent advances in data processing have stimulated the demand for learning graphs of very large scales. Graph Neural Networks (GNNs), being an emerging and powerful approach in solving graph learning tasks, are known to be difficult to scale up. Most scalable models apply node-based techniques in simplifying the expensive graph message-passing propagation procedure of GNN. However, we find such acceleration insufficient when applied to million- or even billion-scale graphs. In this work, we propose SCARA, a scalable GNN with feature-oriented optimization for graph computation. SCARA efficiently computes graph embedding from node features, and further selects and reuses feature computation results to reduce overhead. Theoretical analysis indicates that our model achieves sub-linear time complexity with a guaranteed precision in propagation process as well as GNN training and inference. We conduct extensive experiments on various datasets to evaluate the efficacy and efficiency of SCARA. Performance comparison with baselines shows that SCARA can reach up to 100x graph propagation acceleration than current state-of-the-art methods with fast convergence and comparable accuracy. Most notably, it is efficient to process precomputation on the largest available billion-scale GNN dataset Papers100M (111M nodes, 1.6B edges) in 100 seconds

    Label Propagation for Graph Label Noise

    Full text link
    Label noise is a common challenge in large datasets, as it can significantly degrade the generalization ability of deep neural networks. Most existing studies focus on noisy labels in computer vision; however, graph models encompass both node features and graph topology as input, and become more susceptible to label noise through message-passing mechanisms. Recently, only a few works have been proposed to tackle the label noise on graphs. One major limitation is that they assume the graph is homophilous and the labels are smoothly distributed. Nevertheless, real-world graphs may contain varying degrees of heterophily or even be heterophily-dominated, leading to the inadequacy of current methods. In this paper, we study graph label noise in the context of arbitrary heterophily, with the aim of rectifying noisy labels and assigning labels to previously unlabeled nodes. We begin by conducting two empirical analyses to explore the impact of graph homophily on graph label noise. Following observations, we propose a simple yet efficient algorithm, denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three steps: (1) reconstruct the graph to recover the homophily property, (2) utilize label propagation to rectify the noisy labels, (3) select high-confidence labels to retain for the next iteration. By iterating these steps, we obtain a set of correct labels, ultimately achieving high accuracy in the node classification task. The theoretical analysis is also provided to demonstrate its remarkable denoising "effect". Finally, we conduct experiments on 10 benchmark datasets under varying graph heterophily levels and noise types, comparing the performance of LP4GLN with 7 typical baselines. Our results illustrate the superior performance of the proposed LP4GLN
    corecore